1. Field of the Invention
The present invention relates to an optical character reading apparatus for reading characters of general documents such as English texts, and checking the spelling of the read documents.
2. Description of the Related Art
In recent years, an optical character reading apparatus (OCR) for reading characters of general documents such as English texts has been developed, and is beginning to be put into practical applications. In such an OCR, the format of a document to be read is not limited, and a variety of documents can be read.
The OCR has a function of checking character recognition results in units of words so as to confirm and correct the reading results. In the OCR having the spelling checking function, it is important to correctly segment words to be checked from the character recognition results so as to improve accuracy.
In general, words are segmented on the basis of physical divisions or boundaries between words recorded on a paper sheet. More specifically, a space between words is checked, and if a space equal to or larger than a predetermined value is found, it is determined as a division of words. In a general English document, however, an original division of words is often different from a division of words printed on a paper sheet.
For example, a word generated by connecting a plurality of independent words by hyphens (to be referred to as a compound word hereinafter) is segmented as one word. A word, which is divided to be located at the end of a line and at the beginning of the next line for the printing convenience since it is located at a line end position of a paper sheet (to be referred to as a separated word), is segmented as two words (in general, the separated word is recorded so that a hyphen is added to a character string (a portion of a word) at the line end position).
Such compound words and separated words are not registered in a dictionary prepared in advance for the spelling checking function. Words included in a compound word are originally those which are to be spelling-checked independently. In a separated word, character strings different from an original word are spelling-checked. Therefore, compound words and separated words cannot be correctly spelling-checked.
In order to solve this problem, all words including compound words and separated words may be registered in a dictionary. However, since the number of words becomes huge in this case, a very large capacity for storing a dictionary is required. For this reason, this method is not practical.
In this manner, a conventional OCR cannot correctly segment compound words or separated words as words to be spelling-checked. For this reason, even when the spelling checking function is executed, the character recognition results cannot be reliably confirmed and corrected.